Pre-Processing Closed Captions For Machine Translation
نویسندگان
چکیده
We describe an approach to Machine Translation of transcribed speech, as found in closed captions. We discuss how the colloquial nature and input format peculiarities of closed captions are dealt with in a pre-processing pipeline that prepares the input for effective processing by a core MT system. In particular, we describe components for proper name recognition and input segmentation. We evaluate the contribution of such modules to the system performance. The described methods have been implemented on an MT system for translating English closed captions to Spanish and Portuguese. 1 I n t r o d u c t i o n Machine Translation (MT) technology can be embedded in a device to perform real time translation of closed captions included in TV signals. While speed is one factor associated with the construction of such a device, another factor is the language type and format. The challenges posed by closed captions to MT can be attributed to three distinct characteristics: Firstly, closed captions are transcribed speech. Although closed captions are not a completely faithful transcription of TV programs, they render spoken language and therefore the language used is typically colloquial (Nyberg and Mitamura, 1997). They contain many of the phenomena which characterize spoken language: interjections, repetitions, stuttering, ellipsis, interruptions, hesitations. Linguistically and stylistically they differ from written language: sentences are shorter and poorly structured, and contain idiomatic expressions, ungrammaticality, etc. The associated difficulties stem from the inherently colloquial nature of closed captions, and, to different degrees, of all forms of transcribed speech (Hindle, 1983). Such difficulties require a different approach than is taken for written documents. Secondly, closed captions come in a specific format, which poses problems for their optimal processing. Closed-captioners may often split a single utterance between two screens, if the character limit for a screen has been exceeded. The split is based on consideration about string length, rather than linguistic considerations, hence it can happen at non-constituent boundaries (see Table 1), thus making the real time processing of the separate segments problematic. Another problem is that captions have no upper/lower case distinction. This poses challenges for proper name recognition since names cannot be identified by an initial capital. Additionally, we cannot rely on the initial uppercase letter to identify a sentence initial word. This problematic aspect sets the domain of closed captions apart from most text-to-text MT domains, making it more akin, in this respect, to speech translation systems. Although, from a technical point of view, such input format characteristics could be amended, most likely they are not under a developer's control, hence they have to be presumed. Thirdly, closed captions are used under operational constraints. Users have no control over the speed of the image or caption flow so (s)he must comprehend the caption in the limited time that the caption appears on the screen. Accordingly, the translation of closed captions is a "time-constrained" application, where the user has limited time to comprehend the system output. Hence, an MT system should produce translations comprehensible within the limited time available to the viewer. In this paper we focus on the first two factors, as the third has been discussed in (Toole et al., 1998). We discuss how such domain-
منابع مشابه
Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast
We describe a system to rapidly generate high-quality closed captions and subtitles for live broadcasted TV shows, using automated components, namely Automatic Speech Recognition and Machine Translation. The human stays in the loop for quality assurance and optional postediting. We also describe how the system feeds the human edits and corrections back into the different components for improvem...
متن کاملExplanation-based Learning for Machine Translation
In this paper we present an application of explanation-based learning (EBL) in the parsing module of a real-time English-Spanish machine translation system designed to translate closed captions. We discuss the efficiency/coverage trade-offs available in EBL and introduce the techniques we use to increase coverage while maintaining a high level of space and time efficiency. Our performance resul...
متن کاملA Real-time Mt System for Translating Broadcast Captions
This presentation demonstrates a new multi-engine machine translation system, which combines knowledge-based and example-based machine translation strategies for realtime translation of business news captions from English to German.
متن کاملSTAIR Captions: Constructing a Large-Scale Japanese Image Caption Dataset
In recent years, automatic generation of image descriptions (captions), that is, image captioning, has attracted a great deal of attention. In this paper, we particularly consider generating Japanese captions for images. Since most available caption datasets have been constructed for English language, there are few datasets for Japanese. To tackle this problem, we construct a large-scale Japane...
متن کاملEfficient parsing strategies for syntactic analysis of closed captions
We present an efficient multi-level chart parser that was designed for syntactic analysis of closed captions (subtitles) in a real-time Machine Translation (MT) system. In order to achieve high parsing speed, we divided an existing English grammar into multiple levels. The parser proceeds in stages. At each stage, rules corresponding to only one level are used. A constituent pruning step is add...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2000